Introduction to R and Bioconductor for the analysis of NGS data

Pascal Martin

10 octobre 2018

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

NGS development

NGS basics

How does NGS work?

Illumina sequencing: bridge PCR (illustrations from atdbio)

Illumina sequencing

Illustrations from NMBU

Other technologies?

Increased throughput

… and decreased prices

Bioconductor

Bioconductor and the NGS workflow

Bioconductor and the NGS workflow

Specialized packages for about anything…

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

The fasta format

Biostrings containers and accessors

library(Biostrings)

Containers:

“Masked” sequences are also supported (see ?masks)

Manipulation:

Importing sequences from a fasta file

dm3_upstream_filepath <- system.file("extdata","dm3_upstream2000.fa.gz",
                                    package="Biostrings")
dm3_upstream <- readDNAStringSet(dm3_upstream_filepath)
dm3_upstream
##   A DNAStringSet instance of length 26454
##         width seq                                      names               
##     [1]  2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
##     [2]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
##     [3]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
##     [4]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
##     [5]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
##     ...   ... ...
## [26450]  2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454]  2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...
dm3_upstream[[5]]
##   2000-letter "DNAString" instance
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...ATTAATCGATAGATACGGAAAGTCATCCTCGAT

Practice with your own DNA sequence

Like LETTERS in base R, the Biostrings package provides a DNA_ALPHABET.

Note that masks can also be associated to Biostrings and BSgenome objects

Working with large fasta files

The Rsamtools package provides function to work with large fasta file(s).
The FaFile function creates a reference to an indexed fasta file (see ?FaFile).

This is particularly useful to extract sequences within a fasta file:

library(Rsamtools)
fl <- system.file("extdata", "ce2dict1.fa", package="Rsamtools",
                  mustWork=TRUE)
fa <- open(FaFile(fl))
seqinfo(fa)
## Seqinfo object with 5 sequences from an unspecified genome:
##   seqnames  seqlengths isCircular genome
##   pattern01         18         NA   <NA>
##   pattern02         25         NA   <NA>
##   pattern03         24         NA   <NA>
##   pattern04         24         NA   <NA>
##   pattern05         25         NA   <NA>
getSeq(fa, GRanges("pattern05:11-20"))
##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]    10 TTTGGTGGTA                                   pattern05

Whole genome sequences in BSgenome packages

library(BSgenome.Dmelanogaster.UCSC.dm3)
names(Dmelanogaster)[1:5]
## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"
Dmelanogaster$chr2L
##   23011544-letter "DNAString" instance
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...TATTTGCAAATTTTGATGAACCCCCCTTTCAAA

For a masked version of the genome, see:

library(BSgenome.Dmelanogaster.UCSC.dm3.masked)

For adding SNPs info see:

library(BSgenome)
?available.SNPs

Pattern matching

matchPattern("KATCGATA", dm3_upstream[[592]], fixed=FALSE)
##   Views on a 2000-letter DNAString subject
## subject: TCCCAAATTAACTAATGGCTTTTCACGCAGAT...GCCTCACTTTTGTCCACATCTTTTGAAAGGC
## views:
##     start end width
## [1]    72  79     8 [GATCGATA]
## [2]   512 519     8 [TATCGATA]
## [3]   518 525     8 [TATCGATA]

K is G or T, see IUPAC code

Other functions to search for patterns: matchProbePair, findPalindromes, …

vmatchPattern('TATCGATA', Dmelanogaster)

Position weight matrix (PWM)

Probabilistic description of short sequences largely used for TF binding sites

Get a motif (as a PFM):

EcRMotif <- MotifDb::query(MotifDb,"EcR")[1]

seqLogo representation:

Convert to PWM:

EcRpfm <- apply(reverseComplement(EcRMotif[[1]]) *
                    as.integer(mcols(EcRMotif)$sequenceCount),
                2, as.integer)
rownames(EcRpfm) <- rownames(EcRMotif[[1]])
EcRpwm <- PWM(EcRpfm)
PWM for EcR motif
1 2 3 4 5 6 7 8
A 0.11 0.13 0.00 0.00 0.00 0.00 0.13 0.10
C 0.00 0.00 0.00 0.05 0.00 0.13 0.00 0.00
G 0.11 0.07 0.13 0.13 0.00 0.00 0.00 0.00
T 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.12

Scanning a sequence with a PWM

Scanning a sequence with a PWM

Search the motif in chr4 (‘+’ strand only):

EcRHits <- matchPWM(EcRpwm, Dmelanogaster$chr4)
length(EcRHits)
## [1] 2029
EcRHits[1:2]
##   Views on a 1351857-letter DNAString subject
## subject: GAATTCGCGTCCGCTTACCCATGTGCCTGTGG...TAAAAGCAGCCGTCGATTTGAGATATATGAA
## views:
##     start end width
## [1]   255 262     8 [AGGGTGAT]
## [2]   308 315     8 [AAGGGCAT]

For minus strand: use the reverseComplement of PWM

Motif pipeline example

A typical pipeline:

More sequence tools:

Other packages to work with motifs:
- MotifDb
- seqLogo
- TFBSTools
- rGADEM
- PWMEnrich
- BCRANK
- MotIV
- …

For database queries (+ other tools for AA sequences):
seqinr

Sequence alignment

pairwiseAlignment is the core function:

pairwiseAlignment('CTTGCAGTGGTGTATTCATAC',
                  dm3_upstream[[1]],
                  type='global-local')
## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern:   [1] CTTGCAGTGG-TGTATTCATAC 
## subject: [713] CTTGCAGTGGGTGTAT--ATAC 
## score: 5.653368

Levensthein edit distance:

stringDist(c("lazy", "HaZy", "crAzY"))
##   1 2
## 2 2  
## 3 4 5
stringDist(c("lazy", "HaZy", "crAzY"), ignoreCase = TRUE)
##   1 2
## 2 1  
## 3 2 2

Exercise: compare indexes

Consider a sequencing run with 10 multiplexed samples. We have the following 16 indexes available.

indx <- DNAStringSet( c("ATCACG", "CGATGT", "TTAGGC", "TGACCA",
                        "ACAGTG", "GCCAAT", "CAGATC", "ACTTGA",
                        "GATCAG", "TAGCTT", "GGCTAC", "CTTGTA",
                        "CGGCTA", "TCCGCG", "ATGTCA", "AGCGAT"))

Fastq format

\(Q = -10*{\log_{10}(P)}\) <=> \(P = 10^{-\frac{Q}{10}}\)

Working with fastq files

The ShortRead package (M. Morgan et al. 2009)

Import a fastq file with 20K reads:

fq1_path <- system.file(package="ShortRead","extdata","E-MTAB-1147",
                        "ERR127302_1_subset.fastq.gz")
myFastq <- readFastq(fq1_path)

Explore with:

myFastq
## class: ShortReadQ
## length: 20000 reads; width: 72 cycles
myFastq[1:3]
## class: ShortReadQ
## length: 3 reads; width: 72 cycles

Working with fastq files

head(sread(myFastq), 2)
##   A DNAStringSet instance of length 2
##     width seq
## [1]    72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2]    72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
head(quality(myFastq), 2)
## class: FastqQuality
## quality:
##   A BStringSet instance of length 2
##     width seq
## [1]    72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2]    72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
head(id(myFastq), 2)
##   A BStringSet instance of length 2
##     width seq
## [1]    53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2]    53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
encoding(quality(myFastq))[seq(1,51,by=2)]
##  !  #  %  '  )  +  -  /  1  3  5  7  9  ;  =  ?  A  C  E  G  I  K  M  O  Q 
##  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 
##  S 
## 50
alphabet(sread(myFastq))[1:4]
## [1] "A" "C" "G" "T"

Large fastq files

Functions FastqSampler and FastqStreamer
Count the reads in a fastq file:

nr_myFastq <- 0
strm <- FastqStreamer(fq1_path,1000)
repeat {
 ## Get FASTQ chunk:
  fq <- yield(strm)
  if (length(fq) == 0)
   break
 ## Do something on the chunk:
  nr_myFastq <- nr_myFastq + length(fq)
}
close(strm) #close the connection
nr_myFastq
## [1] 20000

QC on fastq files

fastqc from Babraham

Several tools available in R/BioC: ShortRead qa/qa2 functions, qrqc, seqTools, Rqc

Run library(Rqc) on a fastq file:

rqcResultSet <- rqcQA(fq1_path, sample=TRUE)
rqcCycleQualityPlot(rqcResultSet)

rqcCycleBaseCallsLinePlot(rqcResultSet)

Filtering a fastq file

Define some filters:

max1N <- nFilter(threshold=1L) #No 'Ns' in the reads
goodq <- srFilter(function(x){apply(as(quality(x),"matrix"),
                            1,median,na.rm=T)>=30},
                 name="MedianQualityAbove30")
myFilter <- compose(max1N,goodq) #combine filters

Filtering a fastq file

Create a function to filter and trim the reads:

FilterAndTrim <- function(fl,destination=sprintf("%s_filtered",fl))
{
  stream <- FastqStreamer(fl) ## open input stream
  on.exit(close(stream))
  repeat {
    ###get fastq chunk  
    fq <- yield(stream)
    if (length(fq)==0)
      break
    ###TRIM first 4 and last 2 bases
    fq <- narrow(fq,start=5,end=70)
    ###FILTER
    fq <- fq[myFilter(fq)]
    ###write filtered fastq
    writeFastq(fq, destination, mode="a")
  }
}

Apply the function:

FilterAndTrim(fqFiles[1],
              destination=file.path(getwd(),"FilteredFastq.fq"))

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Alignment of NGS reads

R packages: Rbowtie, Rbowtie2, QuasR (Gaidatzis et al. 2015), Rsubread (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014)

Mapping quality scores (MAQ aligner in Li, Ruan, and Durbin 2008):
- base qualities (Phred scores)
- mismatches/indels
- repetitions
- paired reads

See also Wikipedia’s list of alignment tools

SAM/BAM format

SAM format specifications
Useful tools: samtools, Picard tools

BAM file import

library(Rsamtools);library(GenomicAlignments)
library(pasillaBamSubset)
sr <- untreated1_chr4() #single-end
pr <- untreated3_chr4() #paired-end

Use samtools to index the file

indexBam(sr_bamFile)

All functions from samtools are available with R (e.g. sortBam, countBam, filterBam, mergeBam, etc.)

BAM file import

Define what to import

which=GRanges(seqnames="chr4",
              ranges=IRanges(c(75000,1190000),
                             c(85000,1203000)),
              strand="*")
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)

Import single-end reads

srbam <- readGAlignments(sr,param=param)
srbam[1:2]
## GAlignments object with 2 alignments and 5 metadata columns:
##       seqnames strand             cigar    qwidth     start       end
##          <Rle>  <Rle>       <character> <integer> <integer> <integer>
##   [1]     chr4      + 21M13615N50M55N4M        75     72990     86734
##   [2]     chr4      - 4M13615N50M55N21M        75     73007     86751
##           width     njunc |    rname   strand       pos    qwidth
##       <integer> <integer> | <factor> <factor> <integer> <integer>
##   [1]     13745         2 |     chr4        +     72990        75
##   [2]     13745         2 |     chr4        -     73007        75
##                           seq
##                <DNAStringSet>
##   [1] AAAAACTGCA...CGTAGCCACA
##   [2] ATACCTGTGA...TGGACGGCTG
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Import paired-end reads

prbam <- readGAlignmentPairs(pr)
prbam[1:2]
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
##       seqnames strand :     ranges --       ranges
##          <Rle>  <Rle> :  <IRanges> --    <IRanges>
##   [1]     chr4      + : [169, 205] -- [ 326,  362]
##   [2]     chr4      + : [943, 979] -- [1086, 1122]
##   -------
##   seqinfo: 8 sequences from an unspecified genome

There is also a BamFile function to create a reference to a large BAM file without importing it (as for FaFile).
See also the GenomicFiles package for working on many, large files

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Working with genomic ranges

Packages IRanges & GenomicRanges (Lawrence et al. 2013)
See also the Introduction by Martin Morgan, 2014

IRanges Definition

A simple IRanges:

eg = IRanges(start = c(1, 10, 20),
              end = c(4, 10, 19),
              names = c("A", "B", "C"))
eg
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   A         1         4         4
##   B        10        10         1
##   C        20        19         0

IRanges

A bigger IRanges:

set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
width = floor(runif(10000, 0, 100))
ir = IRanges(start, width=width)
ir
## IRanges object with 10000 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]       288       318        31
##       [2]       788       819        32
##       [3]       409       495        87
##       [4]       883       914        32
##       [5]       940       951        12
##       ...       ...       ...       ...
##    [9996]       466       492        27
##    [9997]       899       983        85
##    [9998]       114       124        11
##    [9999]       571       595        25
##   [10000]       900       965        66

IRanges

Vector-like behavior: length, [
Accessors: start, end, width, names

length(ir)
## [1] 10000
width(ir[1:4])
## [1] 31 32 87 32
names(eg)
## [1] "A" "B" "C"

IRanges

Some useful functions:

mid(ir[1:4])
## [1] 303 803 452 898
successiveIRanges(width=rep(10,3),gap=10)
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        10        10
##   [2]        21        30        10
##   [3]        41        50        10
tile(ir[1:2],n=2)
## IRangesList of length 2
## [[1]]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       302        15
##   [2]       303       318        16
## 
## [[2]]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       788       803        16
##   [2]       804       819        16

IRangesList - List objects

irl  <- split(ir,width(ir)) # an IRangesList
irl[[1]][1:3]
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       321       320         0
##   [2]       600       599         0
##   [3]       184       183         0
length(irl)
## [1] 100
head(elementNROWS(irl))
##   0   1   2   3   4   5 
##  96  83 108  95  84 110

S4Vectors List objects

List = list with all elements of the same type. See ?List

start(irl)[1:2]
## IntegerList of length 2
## [["0"]] 321 600 184 297 276 816 87 729 ... 858 52 85 188 308 289 936 669
## [["1"]] 915 576 706 235 678 647 451 138 ... 638 66 979 740 300 869 433 645
log(start(irl)[1:2])
## NumericList of length 2
## [["0"]] 5.77144112313002 6.39692965521615 ... 6.50578406012823
## [["1"]] 6.81892406527552 6.35610766069589 ... 6.46925031679577

Genomic ranges

Package GenomicRanges:

library(GenomicRanges)

GRanges

A typical use case:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
(gr <- exons(txdb))
## GRanges object with 76920 ranges and 1 metadata column:
##            seqnames           ranges strand |   exon_id
##               <Rle>        <IRanges>  <Rle> | <integer>
##       [1]     chr2L     [7529, 8116]      + |         1
##       [2]     chr2L     [8193, 8589]      + |         2
##       [3]     chr2L     [8193, 9484]      + |         3
##       ...       ...              ...    ... .       ...
##   [76918] chrUextra [523024, 523048]      - |     76918
##   [76919] chrUextra [523024, 523086]      - |     76919
##   [76920] chrUextra [523060, 523086]      - |     76920
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

GRanges


Operations on GRanges are generally seqnames-aware and strand-aware (see argument ignore.strand)

GRangesList

(grl <- exonsBy(txdb,by="gene"))
## GRangesList object of length 15682:
## $FBgn0000003 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R [2648220, 2648518]      + |     45123        <NA>
## 
## $FBgn0000008 
## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      + |   20314      <NA>
##    [2]    chr2R [18024496, 18024713]      + |   20315      <NA>
##    [3]    chr2R [18024938, 18025756]      + |   20316      <NA>
##    ...      ...                  ...    ... .     ...       ...
##   [11]    chr2R [18059821, 18059938]      + |   20328      <NA>
##   [12]    chr2R [18060002, 18060339]      + |   20329      <NA>
##   [13]    chr2R [18060002, 18060346]      + |   20330      <NA>
## 
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

Intra-ranges operations

See ?'intra-range-methods'


Also reflect, narrow and threebands, restrict and trim

Inter-ranges operations

See ?'inter-range-methods'

Set operations - Nearest methods

See ?'setops-methods'


See also ?'nearest-methods' including nearest, precede, follow and distance,

Between-range operations / Overlaps

Q: Number of TSS located at >500bp from another gene?

Get all genes and transcrits:

Dmg <- genes(txdb) 
Dmt <- transcriptsBy(txdb,by="gene")

Get all TSS:

Dm_tss <- unlist(reduce(promoters(Dmt,up=0,down=1),min.gap=0L))

Proportion of TSS overlapping with more than 1 gene +/- 500bp:

mean(countOverlaps(Dm_tss,Dmg+500) > 1) #!strand-aware
## [1] 0.2509943
mean(countOverlaps(Dm_tss,Dmg+500,ignore.strand=T) > 1)
## [1] 0.5167952

Overlaps

Obtaining the overlaps:

fov <- findOverlaps(Dm_tss,Dmg+500,ignore.strand=T) ; fov[1:3]
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         1        1383
##   [3]         2           2
##   -------
##   queryLength: 20869 / subjectLength: 15682

Two genes on opposite strands that are overlapping:

Dmg[subjectHits(fov)[queryHits(fov)==1]]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames             ranges strand |     gene_id
##                  <Rle>          <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R [2648220, 2648518]      + | FBgn0000003
##   FBgn0011904    chr3R [2648685, 2648757]      - | FBgn0011904
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Q: How many reads in srbam overlap with gene FBgn0002521?

length(subsetByOverlaps(srbam,Dmg["FBgn0002521"]))
## [1] 358

Q: How many reads in srbam overlap with exons of FBgn0002521?

length(srbam[overlapsAny(srbam,grl[["FBgn0002521"]])])
## [1] 346

Counting reads mapping on features

Reads mapping on exons:

ctex <- summarizeOverlaps(features = grl[seqnames(Dmg)=="chr4"],
                             reads = srbam,
                              mode = Union)

head(assays(ctex)$counts)
##             reads
## FBgn0002521   346
## FBgn0004607     0
## FBgn0004859     4
## FBgn0005558     0
## FBgn0005561     0
## FBgn0005666     0

SummarizedExperiment

(Huber et al. 2015)

Rle

srbam <- readGAlignments(sr)
(covr <- coverage(srbam))
## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
##   Lengths: 23011544
##   Values :        0
## 
## $chr2R
## integer-Rle of length 21146708 with 1 run
##   Lengths: 21146708
##   Values :        0
## 
## $chr3L
## integer-Rle of length 24543557 with 1 run
##   Lengths: 24543557
##   Values :        0
## 
## $chr3R
## integer-Rle of length 27905053 with 1 run
##   Lengths: 27905053
##   Values :        0
## 
## $chr4
## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0
## 
## ...
## <3 more elements>

Average profile on gene bodies

Genes on chromosome 4:

gn4 <- Dmg[seqnames(Dmg)=="chr4"]

Extract gene level profiles:

profgn4 <- covr[gn4]
profgn4[strand(gn4)=="-"] <- lapply(profgn4[strand(gn4)=="-"],rev)
names(profgn4) <- names(gn4) ; profgn4[1:2]
## RleList of length 2
## $FBgn0002521
## integer-Rle of length 9178 with 825 runs
##   Lengths: 86  5 26  1 10  2 13 13  5  5 ... 29  1  1  1  1  1 11  3  4 13
##   Values :  0  1  2  4  5  6  7  8  9  8 ...  8  7  6  7  8  7  6  5  6  5
## 
## $FBgn0004607
## integer-Rle of length 37983 with 41 runs
##   Lengths: 11762    75  1336    15    60 ...    94    75  1986    75   829
##   Values :     0     1     0     1     2 ...     0     1     0     1     0

Extract the first 1Kb as a matrix:

profgn4 <- profgn4[elementNROWS(profgn4)>=1000]
profgn4 <- as(lapply(profgn4,window,1,1000),"RleList")
mat1kb <- matrix(as.numeric(unlist(profgn4, use.names=F)),
                 nrow=length(profgn4), byrow=T,
                 dimnames=list(names(profgn4),NULL))
mat1kb <- mat1kb[rowSums(mat1kb)>0,]

Average profile on gene bodies

Plot the average profile:

df1Kb <- data.frame(Coordinate=1:1000,
                    Coverage=apply(mat1kb,2,mean,na.rm=T,trim=0.03))
ggplot(df1Kb,aes(x=Coordinate,y=Coverage))+
  geom_line()

Extract sequences in a BSgenome using a GRanges

getSeq(Dmelanogaster,gn4[1:2])
##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]  9178 ATCGAATACCCATGCCAAACA...ATAAAAGTACGTTAACAGCA FBgn0002521
## [2] 37983 CAGCTCAGTCGAAAAAAAACG...AACGTACATTTATACGTCCT FBgn0004607
Views(Dmelanogaster,gn4[1:2])
## BSgenomeViews object with 2 views and 1 metadata column:
##               seqnames             ranges strand                       dna
##                  <Rle>          <IRanges>  <Rle>            <DNAStringSet>
##   FBgn0002521     chr4 [1193094, 1202271]      - [ATCGAATACC...GTTAACAGCA]
##   FBgn0004607     chr4 [ 522436,  560418]      + [CAGCTCAGTC...TATACGTCCT]
##               |     gene_id
##               | <character>
##   FBgn0002521 | FBgn0002521
##   FBgn0004607 | FBgn0004607
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Annotations

Annotations

select(org.Dm.eg.db,
       keys=c('FBgn0015664','FBgn0015602'),keytype="FLYBASE",
       columns=c('SYMBOL','UNIGENE','ENTREZID','FLYBASECG'))
##       FLYBASE  SYMBOL UNIGENE ENTREZID FLYBASECG
## 1 FBgn0015664    Dref Dm.7169    34328    CG5838
## 2 FBgn0015602 BEAF-32 Dm.9114    36645   CG10159

AnnotationHub

library(AnnotationHub)
hub <- AnnotationHub()
length(hub) # >43500 datasets
unique(hub$dataprovider)
head(unique(hub$species))
head(unique(ah$rdataclass))

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Import/Export

Outline

  1. What is NGS? Why BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. Import / Export
  7. Visualization

Visualization

Conclusions



Other useful resources



Major Bioconductor contributors







Thank you for your attention!

References

Brenner, S., M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, et al. 2000. “Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays.” Nat. Biotechnol. 18 (6): 630–34.

Davis, S., and P. S. Meltzer. 2007. “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7.

Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21.

Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40.

Durinck, S., P. T. Spellman, E. Birney, and W. Huber. 2009. “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.” Nat Protoc 4 (8): 1184–91.

Gaidatzis, D., A. Lerch, F. Hahne, and M. B. Stadler. 2015. “QuasR: quantification and annotation of short reads in R.” Bioinformatics 31 (7): 1130–2.

Gentleman, Robert C, Vincent J. Carey, Douglas M. Bates, and others. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5: R80. http://genomebiology.com/2004/5/10/R80.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.

Kim, D., B. Langmead, and S. L. Salzberg. 2015. “HISAT: a fast spliced aligner with low memory requirements.” Nat. Methods 12 (4): 357–60.

Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. 2013. “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.” Genome Biol. 14 (4): R36.

Langmead, B., and S. L. Salzberg. 2012. “Fast gapped-read alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.

Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biol. 10 (3): R25.

Lawrence, M., and M. Morgan. 2014. “Scalable Genomics with R and Bioconductor.” Statistical Science 29 (2): 214–26.

Lawrence, M., R. Gentleman, and V. Carey. 2009. “rtracklayer: an R package for interfacing with genome browsers.” Bioinformatics 25 (14): 1841–2.

Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. “Software for computing and annotating genomic ranges.” PLoS Comput. Biol. 9 (8): e1003118.

Li, H., and R. Durbin. 2009. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics 25 (14): 1754–60.

———. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics 26 (5): 589–95.

Li, H., J. Ruan, and R. Durbin. 2008. “Mapping short DNA sequencing reads and calling variants using mapping quality scores.” Genome Res. 18 (11): 1851–8.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.

———. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30.

Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. “The GEM mapper: fast, accurate and versatile alignment by filtration.” Nat. Methods 9 (12): 1185–8.

Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman. 2009. “ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics 25 (19): 2607–8.

Pollard, M. O., D. Gurdasani, A. J. Mentzer, T. Porter, and M. S. Sandhu. 2018. “Long reads: their purpose and place.” Hum. Mol. Genet. 27 (R2): R234–R241.

Pyl, P. T., J. Gehring, B. Fischer, and W. Huber. 2014. “h5vc: scalable nucleotide tallies with HDF5.” Bioinformatics 30 (10): 1464–6.

Tippmann, S. 2015. “Programming tools: Adventures with R.” Nature 517 (7532): 109–10.

Trapnell, C., L. Pachter, and S. L. Salzberg. 2009. “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11.

Wang, K., D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, et al. 2010. “MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.” Nucleic Acids Res. 38 (18): e178.

Zhu, Y., R. M. Stephens, P. S. Meltzer, and S. R. Davis. 2013. “SRAdb: query and use public next-generation sequencing data from within R.” BMC Bioinformatics 14 (January): 19.